[SOUND].
This lecture is about
a statistical language model.
In this lecture, we're go,
we're going to get an introduction
to the probabilistic model.
This has to do with how many models
have to go into these models.
So, it's ready to how we model
theory based on a document.
We're going to talk about, what is
a language model and, then, we're going to
talk about the simplest language model
called a unigram language model.
Which also happens to be the most
useful model for text retrieval.
And finally we'll discuss
possible uses of an m model.
What is a language model?
Well, it's just a probability
distribution over word sequences.
So, here I show one.
This model gives the sequence today's
Wednesday a probability of 0.001 it gives
today Wednesday is a very very small
probability, because it's algorithmatical.
You can see the probabilities
given to these sentences or
sequences of words can vary
a lot depending on the model.
Therefore, it's clearly context-dependent.
In ordinary conversation,
probably today is Wednesday is most
popular among these sentences.
But imagine in the context of
discussing a private math,
maybe the higher values positive
would have a higher probability.
This means it can be used to
represent as a topic of a test.
The model can also be regarded
as a probabilistic mechanism for
generating text, and this is why it
is often called a generating model.
So, what does that mean?
We can image this is
a mechanism that's visualized
here as a [INAUDIBLE] system that
can generate a sequences of words.
So we can ask for a sequence and
it's to sample a sequence
from the device if you want.
And it might generate, for
example, today is Wednesday, but
it could have generated
many other sequences.
So for example,
there are many possibilities, right?
So this, in this sense,
we can view our data as basically a sample
observed from such a generated model.
So why is such a model useful?
Well, it's mainly because it can quantify
the uncertainties in natural language.
Where do uncertainties come from?
Well, one source is
simply the ambiguity in
natural language that we
discussed earlier in the lecture.
Another source is because we don't
have complete understanding.
We lack all the knowledge
to understand language.
In that case there will
be uncertainties as well.
So let me show some examples of questions
that we can answer with an average model
that would have an interesting
application in different ways.
Given that we see John and feels.
How likely will we see
happy as opposed to habit
as the next word in a sequence of words?
Obviously this would be very useful
speech recognition because happy and
habit would have similar acoustical sound.
Acoustic signals.
But if we look at the language model
we know that John feels happy would be
far more likely than John feels habit.
Another example, given that we
observe baseball three times and
gained once in the news article
how likely is it about the sports?
This obviously is related to text
categorization and information.
Also, given that a user is
interested in sports news,
how likely would the user
use baseball in a query?
Now this is clearly related to the query
that we discussed in the previous lecture.
So now let's look at
the simplest language model.
Called a lan, unigram language model.
In such a case,
we assume that we generate the text by
generating each word independently.
So this means the probability of
a sequence of words would be then
the product of
the probability of each word.
Now normally they are not independent,
right?
So if you have seen a word like language.
Now, we'll make it far more
likely to observe model
than if you haven't seen language.
So this assumption is
not necessary sure but
we'll make this assumption
to simplify the model.
So now, the model has precisely n
parameters, where n is vocabulary size.
We have one probability for each word, and
all these probabilities must sum to 1.
So strictly speaking,
we actually have N minus 1 parameters.
As I said,
text can be then be assumed to be a sample
drawn from this word distribution.
So for example,
now we can ask the device, or the model,
to stochastically generate the words for
us instead of in sequences.
So instead of giving a whole
sequence like today is Wednesday,
it now gives us just one word.
And we can get all kinds of words.
And we can assemble these
words in a sequence.
So, that would still allows you to
compute the probability of today is Wed
as the product of the three probabilities.
As you can see even though we have
not asked the model to generate the,
the sequences it actually allows
us to compute the probability for
all the sequences.
But this model now only needs
N parameters to characterize.
That means if we specify
all the probabilities for
all the words then the model's
behavior is completely specified.
Whereas if you, we don't make this
assumption we would have to specify.
Find probabilities for all kinds of
combinations of words in sequences.
So by making this assumption, it makes it
much easier to estimate these parameters.
So let's see a specific example here.
Here I show two unigram lambda
models with some probabilities and
these are high probability
words that are shown on top.
The first one clearly suggests
the topic of text mining
because the high probability words
are all related to this topic.
The second one is more related to health.
Now, we can then ask the question how
likely we'll observe a particular text
from each of these three models.
Now suppose with sample
words to form the document,
let's say we take the first distribution
which are the sample words.
What words do you think it would
be generated or maybe text?
Or maybe mining maybe another word?
Even food,
which has a very small probability,
might still be able to show up.
But in general, high probability
words will likely show up more often.
So we can imagine a generated
text that looks like text mining.
A factor with a small probability,
you might be able to actually generate
the actual text mining paper that
would actually be meaningful, although
the probability would be very, very small.
In the extreme case, you might imagine
we might be able to generate a,
a text paper, text mining paper that
would be accepted by a major conference.
And in that case the probability
would be even smaller.
For instance nonzero probability,
if we assume none of the words
will have a nonzero probability.
Similarly from the second topic,
we can imagine we can generate a food and
nutrition paper.
That doesn't mean we cannot generate this
paper from text mining distribution.
We can, but the probability would be very,
very small, maybe smaller than even
generating a paper that can be accepted
by a major conference on text mining.
So the point of here is
that given a distribution,
we can talk about the probability of
observing a certain kind of text.
Some text would have higher
probabilities than others.
Now, let's look at the problem
in a different way.
Supposedly, we now have
available a particular document.
In this case, maybe the abstract or
the text mining paper, and
we see these word accounts here.
The total number of words is 100.
Now the question you ask here
is a estimation question.
We can ask the question, which model,
which word distribution has been used to,
to generate this text.
Assuming the text has been generated by
assembling words from the distribution.
So what would be your guess?
What have to decide what probabilities
test, mining, et cetera would have.
So pause a view for a second and
try to think about your best guess.
If you're like a lot of people
you would have guessed that well,
my best guess is text has
a probability of 10 out of 100
because I have seen text ten times and
there are a total of 100 words.
So we simply noticed,
normalize these counts.
And that's in fact [INAUDIBLE] justified.
And your intuition is consistent
with mathematical derivation.
And this is called a maximum
likelihood [INAUDIBLE].
In this estimator,
we'll assume that the parameter settings,.
Are those that would give our
observer the maximum probability.
That means if we change
these probabilities,
then the probability of observing the
particular text would be somewhat smaller.
So we can see this has
a very simple formula.
Basically, we just need to
look at the count of a word
in the document and then divide it by
the total number of words in the document.
About the length.
Normalize the frequency.
Well a consequence of this,
is of course, we're going to assign
0 probabilities to unseen words.
If we have an observed word,
there will be no incentive to assign
a non-0 probability using this approach.
Why?
Because that would take away probability
mass for this observed words.
And that obviously wouldn't
maximize the probability of this
particular observed [INAUDIBLE] data.
But one can still question whether
this is our best estimator.
Well, the answer depends on what kind
of model you want to find, right?
This is made if it's a best model
based on this particular layer.
But if you're interested in a model
that can explain the content of the four
paper of, for this abstract, then you
might have a second thought, right?
So for one thing there should be other
things in the body of that article.
So they should not have,
zero probabilities,
even though they are not
observing the abstract.
We're going to cover this later, in,
discussing the query model.
So, let's take a look at some possible
uses of these language models.
One use is simply to use
it to represent the topics.
So here it shows some general
English background that text.
We can use this text to
estimate a language model.
And the model might look like this.
Right?
So on the top we'll have those all common
words, is we, is, and then we'll
see some common words like these,
and then some very,
very real words in the bottom.
This is the background image model.
It represents the frequency on words,
in English in general, right?
This is the background model.
Now, let's look at another text.
Maybe this time, we'll look at
Computer Science research papers.
So we have a correction of computer
science research papers, we do again,
we can just use the maximum where we
simply normalize the frequencies.
Now, in this case, we look at
the distribution, that looks like this.
On the top, it looks similar,
because these words occur everywhere,
they are very common.
But as we go down we'll see words that
are more related to computer science.
Computer, or software, or text et cetera.
So, although here, we might also see
these words, for example, computer.
But, we can imagine the probability here
is much smaller than the probability here.
And we will see many
other words here that,
that would be more common
in general in English.
So, you can see this distribution
characterizes a topic
of the corresponding text.
We can look at the, even the smaller text.
So, in this case let's look
at the text mining paper.
Now if we do the same we have another.
Distribution again the can be
expected to occur on the top.
Soon we will see text, mining,
association, clustering,
these words have relatively
high probabilities in contrast
in this distribution has
relatively small probability.
So this means,
again based on different text data
that we can have a different model.
And model captures the topic.
So we call this document an LM model and
we call this collection LM model.
And later, we'll see how they're
used in a retrieval function.
But now, let's look at the,
another use of this model.
Can we statistically find what words
are semantically related to computer?
Now how do we find such words?
Well our first thought is well let's
take a look at the text that match.
Computer.
So we can take a look at all the documents
that contain the word computer.
Let's build a language model.
Okay, see what words we see there.
Well, not surprisingly, we see these
common words on top as we always do.
So in this case,
this language model gives us the.
Conditional probability of seeing
a word in the context of computer.
And these common words will
naturally have high probabilities.
Other words will see computer itself, and
software will have relatively
high probabilities.
But we,
if we just use this model we cannot.
I just say all these words
are semantically related to computer.
So intuitively what we'd like to get
rid of these these common words.
How can we do that?
It turns out that it's possible
to use language model to do that.
Now I suggest you think about that.
So how can we know what
words are very common so
that we want to kind of get rid of them.
What model will tell us that?
Well, maybe you can think about that.
So the background language model
precisely tells us this information.
It tells us what words
are common in general.
So if we use this background model,
we would know that these words
are common words in general.
So it's not surprising to observe
them in the context of computer.
Whereas computer has a very
small probability in general.
So it's very surprising that we have
seen computer in, with this probability.
And the same is true for software.
So then we can use these two
models to somehow figure out.
The words that are related to computer.
For example we can simply take the ratio
of these two probabilities and normalize
the top of the model by the probability
of the word in the background model.
So if we do that, we take the ratio,
we'll see that then on the top,
computer, is ramped, and
then followed by software,
program, all these words
related to computer.
Because they occur very frequently
in the context of computer, but
not frequently in whole connection.
Where as these common words will
not have a high probability.
In fact,
they have a ratio of about one down there.
Because they are not really
related to computer.
By taking the same ball of text
that contains the computer we don't
really see more occurrences
of that in general.
So this shows that even
with this simple LM models,
we can do some limited
analysis of semantics.
So in this lecture,
we talked about, language model,
which is basically a probability
distribution over the text.
We talked about the simplistic language
model called unigram language model.
Which is also just a word distribution.
We talked about the two
uses of a language model.
One is to represent the, the topic in
a document, in a classing or in general.
The other is discover word associations.
In the next lecture we're
going to talk about the how
language model can be used to
design a retrieval function.
Here are two additional readings.
The first is a textbook on statistical and
natural language processing.
The second is a article that has
a survey of statistical language
models with other pointers
to research work.
[MUSIC]

